Voice Conversion Using A Two-Factor Gaussian Process Latent Variable Model
نویسندگان
چکیده
This paper presents a novel strategy for voice conversion by solving style and content separation task using a two-factor Gaussian Process Latent Variable Model (GP-LVM). A generative model for speech is developed by interaction of style and content, which represent the voice individual characteristics and semantic information respectively. The interaction is captured by a GP-LVM with two latent variables, as well as a GP mapping to observation. Then, for a given collection of labelled observations, the separation task is accomplished by fitting the model with Maximum Likelihood method. Finally, voice conversion is implemented by style alternation, and the desired speech is reconstructed with the decomposed target speaker style and the source speech content using the learned model as a prior. Both objective and subjective test results show the advantage of the proposed method compared to the traditional GMM-based mapping system with limited size of training data. Furthermore, experimental results indicate that the GP-LVM with nonlinear kernel functions behaves better than that with linear ones for voice conversion due to its ability of better capturing the interaction between style and content, and rich varieties of the two factors in a training set also help to improve the conversion performance. Streszczenie. W artykule opisano nową strategię konwersji głosu, poprzez rozdzielenie rodzaju i treści, przy wykorzystaniu dwu-wskaźnikowej metody GPLVM (ang. Gaussian Process Latent Variable Model). Wykonane badania wskazują na lepsze działanie proponowanego algorytmu w porównaniu z tradycyjnie stosowanym systemem mapowania typu GMM przy ograniczonej ilości danych do testowania. Wykazano, że GPLVM ma lepsze właściwości w konwersji głosu z nieliniową niż liniową funkcją jądra. (Dwuwskaźnikowa metoda GPLVM w procesie konwersji głosu).
منابع مشابه
Using Context-based Statistical Models to Promote the Quality of Voice Conversion Systems
This article aims to examine methods of optimizing GMM-based voice conversion systems performance in which GMM method is introduced as the basic method for improvement of voice conversion systems performance. In the current methods, due to using a single conversion function to convert all speech units and subsequent spectral smoothing arising from statistical averaging, we will observe quality ...
متن کاملEfficient Modeling of Latent Information in Supervised Learning using Gaussian Processes
Often in machine learning, data are collected as a combination of multiple conditions, e.g., the voice recordings of multiple persons, each labeled with an ID. How could we build a model that captures the latent information related to these conditions and generalize to a new one with few data? We present a new model called Latent Variable Multiple Output Gaussian Processes (LVMOGP) and that all...
متن کاملVoice Morphing Using the Generative Topographic Mapping
In this paper we address the problem of Voice Morphing. We attempt to transform the spectral characteristics of a source speakers speech signal so that the listener would believe that the speech was uttered by a target speaker. The voice morphing system transforms the spectral envelope as represented by a Linear Prediction model. The transformation is achieved by codebook mapping using the Gen...
متن کاملA Gaussian process latent variable model formulation of canonical correlation analysis
We investigate a nonparametric model with which to visualize the relationship between two datasets. We base our model on Gaussian Process Latent Variable Models (GPLVM)[1],[2], a probabilistically defined latent variable model which takes the alternative approach of marginalizing the parameters and optimizing the latent variables; we optimize a latent variable set for each dataset, which preser...
متن کاملSpeech-Driven Facial Animation Using a Shared Gaussian Process Latent Variable Model
In this work, synthesis of facial animation is done by modelling the mapping between facial motion and speech using the shared Gaussian process latent variable model. Both data are processed separately and subsequently coupled together to yield a shared latent space. This method allows coarticulation to be modelled by having a dynamical model on the latent space. Synthesis of novel animation is...
متن کامل